A Universal Part-of-Speech Tagset
نویسندگان
چکیده
To facilitate future research in unsupervised induction of syntactic structure and to standardize best-practices, we propose a tagset that consists of twelve universal part-ofspeech categories. In addition to the tagset, we develop a mapping from 25 different treebank tagsets to this universal set. As a result, when combined with the original treebank data, this universal tagset and mapping produce a dataset consisting of common partsof-speech for 22 different languages. We highlight the use of this resource via two experiments, including one that reports competitive accuracies for unsupervised grammar induction without gold standard part-of-speech tags.
منابع مشابه
Universal Dependencies for Japanese
We present an attempt to port the international syntactic annotation scheme, Universal Dependencies, to the Japanese language in this paper. Since the Japanese syntactic structure is usually annotated on the basis of unique chunk-based dependencies, we first introduce word-based dependencies by using a word unit called the Short Unit Word, which usually corresponds to an entry in the lexicon Un...
متن کاملReusable Tagset Conversion Using Tagset Drivers
Part-of-speech or morphological tags are important means of annotation in a vast number of corpora. However, different sets of tags are used in different corpora, even for the same language. Tagset conversion is difficult, and solutions tend to be tailored to a particular pair of tagsets. We propose a universal approach that makes the conversion tools reusable. We also provide an indirect evalu...
متن کاملWhat Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages
In this paper we address the problem of multilingual part-of-speech tagging for resource-poor languages. We use parallel data to transfer part-of-speech information from resource-rich to resourcepoor languages. Additionally, we use a small amount of annotated data to learn to “correct” errors from projected approach such as tagset mismatch between languages, achieving state-of-the-art performan...
متن کاملInternal and external tagsets in part-of-speech tagging
We present an approach to statistical partof-speech tagging that uses two di erent tagsets, one for its internal and one for its external representation. The internal tagset is used in the underlying Markov model, while the external tagset constitutes the output of the tagger. The internal tagset can be modi ed and optimized to increase tagging accuracy (with respect to the external tagset). We...
متن کاملPart-of-speech Tagging in French Te Experiments in Tagset
Part-of-speech tagging is needed for French Text-to-Speech (TTS) synthesis to disambiguate the pronunciation of homograph heterophones, liaison instances, and eventually to model intonational contours. A core problem in the part-of-speech tagging in French TTS is to decide on the tagset used for the tagger and the tagset needed by TTS. We carried out a number of experiments on several sizes of ...
متن کامل